notebook.community

Edit and run



In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as plt
from IPython.display import display_html

数据来源 Titanic: Machine Learning from Disaster



In [2]:

    
df = pd.read_csv('data/train.csv')
df.head(10)   # 打印出前 10 条看看样本数据









    Out[2]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
        1
       0
       3
                                 Braund, Mr. Owen Harris
         male
       22
       1
       0
              A/5 21171
        7.2500
        NaN
       S
    
    
      1
        2
       1
       1
       Cumings, Mrs. John Bradley (Florence Briggs Th...
       female
       38
       1
       0
               PC 17599
       71.2833
        C85
       C
    
    
      2
        3
       1
       3
                                  Heikkinen, Miss. Laina
       female
       26
       0
       0
       STON/O2. 3101282
        7.9250
        NaN
       S
    
    
      3
        4
       1
       1
            Futrelle, Mrs. Jacques Heath (Lily May Peel)
       female
       35
       1
       0
                 113803
       53.1000
       C123
       S
    
    
      4
        5
       0
       3
                                Allen, Mr. William Henry
         male
       35
       0
       0
                 373450
        8.0500
        NaN
       S
    
    
      5
        6
       0
       3
                                        Moran, Mr. James
         male
      NaN
       0
       0
                 330877
        8.4583
        NaN
       Q
    
    
      6
        7
       0
       1
                                 McCarthy, Mr. Timothy J
         male
       54
       0
       0
                  17463
       51.8625
        E46
       S
    
    
      7
        8
       0
       3
                          Palsson, Master. Gosta Leonard
         male
        2
       3
       1
                 349909
       21.0750
        NaN
       S
    
    
      8
        9
       1
       3
       Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
       female
       27
       0
       2
                 347742
       11.1333
        NaN
       S
    
    
      9
       10
       1
       2
                     Nasser, Mrs. Nicholas (Adele Achem)
       female
       14
       1
       0
                 237736
       30.0708
        NaN
       C



In [3]:

    
df.describe(percentiles=[.1, .25, .5, .75, .9])









    Out[3]:






  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
       891.000000
       891.000000
       891.000000
       714.000000
       891.000000
       891.000000
       891.000000
    
    
      mean
       446.000000
         0.383838
         2.308642
        29.699118
         0.523008
         0.381594
        32.204208
    
    
      std
       257.353842
         0.486592
         0.836071
        14.526497
         1.102743
         0.806057
        49.693429
    
    
      min
         1.000000
         0.000000
         1.000000
         0.420000
         0.000000
         0.000000
         0.000000
    
    
      10%
        90.000000
         0.000000
         1.000000
        14.000000
         0.000000
         0.000000
         7.550000
    
    
      25%
       223.500000
         0.000000
         2.000000
        20.125000
         0.000000
         0.000000
         7.910400
    
    
      50%
       446.000000
         0.000000
         3.000000
        28.000000
         0.000000
         0.000000
        14.454200
    
    
      75%
       668.500000
         1.000000
         3.000000
        38.000000
         1.000000
         0.000000
        31.000000
    
    
      90%
       802.000000
         1.000000
         3.000000
        50.000000
         1.000000
         2.000000
        77.958300
    
    
      max
       891.000000
         1.000000
         3.000000
        80.000000
         8.000000
         6.000000
       512.329200

从 describe 可以看出

共有 891 个样本
Survived 存活率 38%
Pclass 超过 50% 的乘客是三等舱
Age
- 有缺失数据
- 平均年龄 30
- 最小年龄 0.42 是错误的
- 1/4 的人在 20 岁以下
SibSp 一般的人有配偶/兄弟姐妹
Parch 38% 的人有父母/子女
Fare 费用
- 最小为 0
- 3/4 的人费用集中在 31 左右
- 90%/max 相差很大则 max 值很有可能是错误的

分布情况柱状图



In [4]:

    
fig = plt.pyplot.figure(figsize=(30, 4))
ax = fig.add_subplot(131)
ax.hist(df['Age'], bins=10, range=[df['Age'].min(), df['Age'].max()])
ax.set_xlabel('Age')
ax.set_ylabel('Age distribution')

ax = fig.add_subplot(132)
ax.hist(df['Fare'], bins=10, range=(df['Fare'].min(), df['Fare'].max()))
ax.set_xlabel('Fare')
ax.set_ylabel('Fare distribution')

ax = fig.add_subplot(133)
s = df['Fare']
ax.hist(s[s < s.max()], bins=10)
ax.set_xlabel('Fare without max')
ax.set_ylabel('Fare distribution')
plt.pyplot.show()

Box plot



In [5]:

    
df.boxplot('Fare', by='Pclass', figsize=(20, 4))









    Out[5]:





<matplotlib.axes.AxesSubplot at 0x7f2116d84a90>



In [6]:

    
grouped = df.groupby('Pclass')

fig = plt.pyplot.figure(figsize=(30, 4))
ax = fig.add_subplot(121)
ax.set_title('Pclass count')
ax.set_xlabel('Pclass')
ax.set_ylabel('Count')
grouped.Survived.count().plot(kind='bar')

ax = fig.add_subplot(122)
ax.set_title('Pclass survived')
ax.set_xlabel('Pclass')
ax.set_ylabel('Survived Percentage')
(grouped.Survived.sum() / grouped.Survived.count()).plot(kind='bar')









    Out[6]:





<matplotlib.axes.AxesSubplot at 0x7f2116c98990>



In [7]:

    
df2 = pd.crosstab([df.Pclass, df.Sex], df.Survived.astype(bool))
display_html(df2)
df2.plot(kind='bar', stacked=True, color=['red', 'g'], figsize=(20, 5), fontsize=16)









    






  
    
      
      Survived
      False
      True
    
    
      Pclass
      Sex
      
      
    
  
  
    
      1
      female
         3
       91
    
    
      male
        77
       45
    
    
      2
      female
         6
       70
    
    
      male
        91
       17
    
    
      3
      female
        72
       72
    
    
      male
       300
       47
    
  








    Out[7]:





<matplotlib.axes.AxesSubplot at 0x7f2116c245d0>



In [8]:

    
from IPython.display import FileLink
FileLink('Titanic baby step for pandas Part 2.ipynb')









    Out[8]:




Titanic baby step for pandas Part 2.ipynb

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14	1	0	237736	30.0708	NaN	C

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
10%	90.000000	0.000000	1.000000	14.000000	0.000000	0.000000	7.550000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
90%	802.000000	1.000000	3.000000	50.000000	1.000000	2.000000	77.958300
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Survived	False	True
Pclass	Sex
1	female	3	91
1	male	77	45
2	female	6	70
2	male	91	17
3	female	72	72
3	male	300	47

分布情况 柱状图

Box plot

分布情况柱状图